By the end of the lab, you will be able to …
Download and open code-along-02.qmd
Load the standard packages.
Install and load the summarytools package.
Operators in R are symbols directing R to perform various kinds mathematical, logical, and decision operations. A few of the key ones to know before we get started:
To test equality or inequality:
==, !=, >, >=, <, <=
To indicate “and”, “or”, and “not”:
& | !
Assigning values to various data objects: <- -> =
Functions are (most often) verbs, followed by what they will be applied to in parentheses:
You can access the variables (i.e., columns) using the $ operator, as shown using the table() function.
The variable names are case sensitive. In this dataset, all variables are lowercase.
195 respondents were coded as 1 on this variable. What does that mean?
dplyr grammarWhat’s the advantage of dplyr grammar? We can sequence data manipulation!
# A tibble: 2 × 10
sex variable mean sd min med max n.valid n pct.valid
<dbl+lbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 [male] sex 41.7 13.7 0 40 89 869 1467 59.2
2 2 [female] sex 37.3 13.7 0 40 89 891 1823 48.9
https://sta199-f24.github.io/slides/03-grammar-of-data-transformation-slides.html#/the-pipe
Political polarization is high in the U.S. today and attitudes about gender and family behavior have been heavily debated.
Using the most recent survey, do more liberals than conservatives think sex before marriage is ‘not wrong at all’?
How do we find out?
Let’s familiarize ourselves with the premarsx and polviews variables.
In the console, type ?premarsx and hit enter. The Help pane will show you the question text, response options and values.
Now, do the same for polviews.
Run this code to see the frequency table for the premarsx variable. Then, add a line below to also see a table for the polviews variable.
The table command also let’s you create a table with two variables.
Use haven::as_factor to see the value labels instead of the value numbers. Then, do the same for polviews.
always wrong almost always wrong
357 122
wrong only sometimes not wrong at all
258 1378
other iap
0 1126
don't know I don't have a job
50 0
dk, na, iap no answer
0 6
not imputable refused
0 0
skipped on web uncodeable
12 0
not available in this release not available in this year
0 0
see codebook
0
extremely liberal liberal
140 421
slightly liberal moderate, middle of the road
368 1148
slightly conservative conservative
381 516
extremely conservative don't know
186 99
iap I don't have a job
0 0
dk, na, iap no answer
0 20
not imputable refused
0 0
skipped on web uncodeable
30 0
not available in this release not available in this year
0 0
see codebook
0
Let’s clean up the levels for premarsx.
gss24$premarsx <- zap_missing(gss24$premarsx)
gss24$premarsx <- as_factor(gss24$premarsx)
table(gss24$premarsx)
always wrong almost always wrong wrong only sometimes
357 122 258
not wrong at all other
1378 0
Let’s get rid of the empty levels in premarsx.
always wrong almost always wrong wrong only sometimes
357 122 258
not wrong at all
1378
For polviews, let’s combine categories to ease interpretation. This is easiest when the levels are numeric.
Let’s remind ourselves what the values correspond with each label.
[1] extremely liberal [2] liberal
140 421
[3] slightly liberal [4] moderate, middle of the road
368 1148
[5] slightly conservative [6] conservative
381 516
[7] extremely conservative [NA] don't know
186 99
[NA] iap [NA] I don't have a job
0 0
[NA] dk, na, iap [NA] no answer
0 20
[NA] not imputable [NA] refused
0 0
[NA] skipped on web [NA] uncodeable
30 0
[NA] not available in this release [NA] not available in this year
0 0
[NA] see codebook
0
gss24 <- gss24 |>
mutate(pol3cat = case_when(
polviews >= 1 & polviews <= 3 ~ "Liberal",
polviews == 4 ~ "Moderate",
polviews >= 5 & polviews <= 7 ~ "Conservative",
TRUE ~ NA_character_),
pol3cat = factor(pol3cat,
levels = c("Liberal", "Moderate", "Conservative"))
)polviews
can be written as |> or %>%
Always double check your work.
Make a frequency table. One of summarytools main purposes is to help cleaning and preparing data for further analysis. Pay attention to the missing values. Then, do the same for premarsx.
Frequencies
gss24$pol3cat
Type: Factor
Freq % Valid % Valid Cum. % Total % Total Cum.
------------------ ------ --------- -------------- --------- --------------
Liberal 929 29.40 29.40 28.07 28.07
Moderate 1148 36.33 65.73 34.69 62.77
Conservative 1083 34.27 100.00 32.73 95.50
<NA> 149 4.50 100.00
Total 3309 100.00 100.00 100.00 100.00
Frequencies
gss24$premarsx
Type: Factor
Freq % Valid % Valid Cum. % Total % Total Cum.
-------------------------- ------ --------- -------------- --------- --------------
always wrong 357 16.88 16.88 10.79 10.79
almost always wrong 122 5.77 22.65 3.69 14.48
wrong only sometimes 258 12.20 34.85 7.80 22.27
not wrong at all 1378 65.15 100.00 41.64 63.92
<NA> 1194 36.08 100.00
Total 3309 100.00 100.00 100.00 100.00
Using report.nas = FALSE suppresses the missing data.
The headings = FALSE parameter suppresses the heading section. Do the same for premarsx.
Based on your table, what percentage of respondents believe sex before marriage is ‘almost always wrong’?
Based on your table, what percentage of respondents believe sex before marriage is ‘always’ or ‘almost always wrong’?
The table() function gives us the frequencies.
Liberal Moderate Conservative
always wrong 32 78 229
almost always wrong 21 44 51
wrong only sometimes 58 91 100
not wrong at all 505 488 331
We want to add the column percentages…
What’s your conclusion to our initial research question?
% who think sex relations before marriage is __________, by political views
Cross-Tabulation, Column Proportions
premarsx * pol3cat
Data Frame: gss24
---------------------- --------- -------------- -------------- -------------- ---------------
pol3cat Liberal Moderate Conservative Total
premarsx
always wrong 32 ( 5.2%) 78 ( 11.1%) 229 ( 32.2%) 339 ( 16.7%)
almost always wrong 21 ( 3.4%) 44 ( 6.3%) 51 ( 7.2%) 116 ( 5.7%)
wrong only sometimes 58 ( 9.4%) 91 ( 13.0%) 100 ( 14.1%) 249 ( 12.3%)
not wrong at all 505 ( 82.0%) 488 ( 69.6%) 331 ( 46.6%) 1324 ( 65.3%)
Total 616 (100.0%) 701 (100.0%) 711 (100.0%) 2028 (100.0%)
---------------------- --------- -------------- -------------- -------------- ---------------
Remember, the mode is the category with the greatest frequency (or the largest percentage). Let’s find it for the premarsx variable.
Frequencies
gss24$premarsx
Type: Factor
Freq % % Cum.
-------------------------- ------ -------- --------
always wrong 357 16.88 16.88
almost always wrong 122 5.77 22.65
wrong only sometimes 258 12.20 34.85
not wrong at all 1378 65.15 100.00
Total 2115 100.00 100.00
We can use the same table we generated before to identify the median. This time, let’s use dplyr grammar to produce the same table
Remember, use the cumulative percentage to locate the 50th percentile.
dplyr grammar, starting with the name of the df and a pipe
freq() function as usual
tb() function to turn the table into a tibble
# A tibble: 4 × 4
premarsx freq pct pct_cum
<fct> <dbl> <dbl> <dbl>
1 always wrong 357 16.9 16.9
2 almost always wrong 122 5.77 22.6
3 wrong only sometimes 258 12.2 34.8
4 not wrong at all 1378 65.2 100
Freq % % Cum.
-------------------- ------ --------- ---------
89+ hours [89] 6 0.339 0.339
[0] 10 0.566 0.905
[2] 3 0.170 1.075
[3] 4 0.226 1.301
[4] 7 0.396 1.697
[5] 9 0.509 2.206
[6] 11 0.622 2.828
[7] 2 0.113 2.941
[8] 15 0.848 3.790
[9] 4 0.226 4.016
[10] 19 1.075 5.090
[12] 12 0.679 5.769
[13] 3 0.170 5.939
[14] 3 0.170 6.109
[15] 18 1.018 7.127
[16] 10 0.566 7.692
[17] 1 0.057 7.749
[18] 4 0.226 7.975
[19] 2 0.113 8.088
[20] 58 3.281 11.369
[21] 3 0.170 11.538
[22] 5 0.283 11.821
[23] 4 0.226 12.048
[24] 21 1.188 13.235
[25] 44 2.489 15.724
[26] 5 0.283 16.007
[27] 4 0.226 16.233
[28] 10 0.566 16.799
[29] 1 0.057 16.855
[30] 68 3.846 20.701
[31] 3 0.170 20.871
[32] 32 1.810 22.681
[33] 3 0.170 22.851
[34] 8 0.452 23.303
[35] 46 2.602 25.905
[36] 29 1.640 27.545
[37] 18 1.018 28.563
[38] 17 0.962 29.525
[39] 5 0.283 29.808
[40] 697 39.423 69.231
[41] 12 0.679 69.910
[42] 23 1.301 71.210
[43] 14 0.792 72.002
[44] 15 0.848 72.851
[45] 92 5.204 78.054
[46] 15 0.848 78.903
[47] 2 0.113 79.016
[48] 21 1.188 80.204
[49] 3 0.170 80.373
[50] 143 8.088 88.462
[51] 3 0.170 88.631
[52] 7 0.396 89.027
[53] 2 0.113 89.140
[54] 5 0.283 89.423
[55] 31 1.753 91.176
[56] 5 0.283 91.459
[58] 4 0.226 91.686
[59] 2 0.113 91.799
[60] 70 3.959 95.758
[61] 1 0.057 95.814
[62] 2 0.113 95.928
[64] 2 0.113 96.041
[65] 13 0.735 96.776
[66] 1 0.057 96.833
[67] 2 0.113 96.946
[68] 2 0.113 97.059
[69] 1 0.057 97.115
[70] 22 1.244 98.360
[72] 5 0.283 98.643
[75] 3 0.170 98.812
[77] 1 0.057 98.869
[78] 1 0.057 98.925
[80] 16 0.905 99.830
[83] 1 0.057 99.887
[84] 1 0.057 99.943
[85] 1 0.057 100.000
Total 1768 100.000 100.000
summary()descr()Descriptive Statistics
gss24$hrs1
Label: Number of hours worked last week
N: 3309
hrs1
----------------- ---------
Mean 39.44
Std.Dev 13.87
Min 0.00
Q1 35.00
Median 40.00
Q3 45.00
Max 89.00
MAD 7.41
IQR 10.00
CV 0.35
Skewness -0.05
SE.Skewness 0.06
Kurtosis 1.54
N.Valid 1768.00
N 3309.00
Pct.Valid 53.43
descr()# A tibble: 1 × 9
variable mean sd min med max n.valid n pct.valid
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 hrs1 39.4 13.9 0 40 89 1768 3309 53.4